Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

280 ◾ Bioinformatics

of “q2-vsearch” plugin. The input is the last preprocessed data artifact “demux-yoga-

merged.qza”.

qiime vsearch dereplicate-sequences \

--i-sequences inputs/demux-yoga-merged.qza \

--o-dereplicated-table inputs/derep-yoga-table.qza \

--o-dereplicated-sequences inputs/derep-yoga-seqs.qza

The outputs from the “dereplicate-sequences” command are two artifacts: (i) feature table

containing the OTU features with their observed abundances (frequencies) for each of the

samples of the study and (ii) feature data in which each feature identifier is mapped to a

feature.

7.3.4.2.1.3 Clustering Methods

Clustering then follows the dereplication. Both feature table and feature data artifacts gen-

erated with the dereplication are required as inputs for clustering. We will use QIIME2

to perform the three types of clustering (de novo, closed-reference, and open-reference

clustering).

7.3.4.2.1.3.1 De Novo Clustering

The de novo clustering does not require a database but it uses sequence similarity to cluster

features into groups. The threshold for similarity can be set to 0.99% (only reads similar

to centroid sequence with identity of 99% are allowed to join the cluster). QIIME2 uses

“cluster-features-de-novo” method of “q2-vsearch” plugin to perform the de novo cluster-

ing. The input artifacts are the feature table and feature data artifacts generated by the

dereplication in the previous step. To keep the clustering in a separate directory, we will

create the “denovo” subdirectory.

mkdir denovo

qiime vsearch cluster-features-de-novo \

--i-table inputs/derep-yoga-table.qza \

--i-sequences inputs/derep-yoga-seqs.qza \

--p-perc-identity 0.99 \

--o-clustered-table denovo/table-yoga-denovo.qza \

--o-clustered-sequences denovo/rep-seqs-yoga-denovo.qza

The outputs are two artifacts: a feature table for the OTUs and feature data that contains

the centroid sequences defining each OTU cluster. De novo clustering usually consumes

more computational resources compared to the other two methods.

7.3.4.2.1.3.2 Closed-Reference Clustering

Closed-reference clustering requires a curated database for the16S rRNA gene sequences as

reference sequences. Only the representative sequences that have matches on the database

are clustered, while the ones that do not have matches will be discarded. Examples of widely

used databases include Greengenes (16S rRNA) at “https://greengenes.secondgenome.